120
Applications in Natural Language Processing
derivative in the backward propagation. In detail, for the weight of binarized linear layers,
the common practice is to redistribute the weight to zero-mean for retaining representation
information [199] and applies scaling factors to minimize quantization errors [199]. The
activation is binarized by the sign without re-scaling for computational efficiency. Thus, the
computation can be expressed as
bi-linear(X) =αw(sign(X) ⊗sign(W −μ(W))),
αw = 1
n∥W∥ℓ1,
(5.3)
where W and X denote full-precision weight and activation, μ(·) denotes the mean value,
αw is the scaling factors for weight, and ⊗denotes the matrix multiplication with bitwise
xnor and bitcount. Besides, the quantization of activation X in Eq. (5.3) is set to higher
bit-widths in some works to boost the performance of binarized BERT [6, 222].
The input data first passes through a quantized embedding layer before being fed into
the transformer blocks [285, 6]. And each transformer block consists of two main components
are the Multi-Head Attention (MHA) module and the Feed-Forward Network (FFN). The
computation of MHA depends on queries Q, keys K, and values V, which are derived from
hidden states H ∈RN×D. N represents the length of the sequence, and D represents the
dimension of features. For a specific transformer layer, the computation in an attention
head can be expressed as
Q = bi-linearQ(H),
K = bi-linearK(H),
V = bi-linearV (H),
(5.4)
where bi-linearQ, bi-linearK, and bi-linearV represent three different binarized linear layers
for Q, K, and V, respectively. Then the attention score A is computed as follows:
A =
1
√
D
BQ ⊗BK
⊤
,
BQ = sign(Q),
BK = sign(K),
(5.5)
where BQ and BK are the binarized query and key, respectively. Note that the obtained
attention weight is then truncated by attention mask, and each row in A can be regarded
as a k-dim vector, where k is the number of unmasked elements. Then attention weights
Bs
A are binarized as
Bs
A = sign(softmax(A)).
(5.6)
Despite the appealing properties of network binarization for relieving the massive pa-
rameters and FLOPs, it is technically hard from an optimization perspective for BERT
binarization. As illustrated in Fig. 5.1, the performance for quantized BERT drops mildly
from 32-bit to as low as 2-bit, i.e., around 0.6% ↓on MRPC and 0.2% ↓on MNLI-m of
the GLUE benchmark [230]. However, when reducing the bit-width to one, the performance
drops sharply, i.e., ∼3.8% ↓and ∼0.9% ↓on the two tasks. In summary, binarization
of BERT brings severe performance degradation compared with other weight bit-widths.
Therefore, BERT binarization remains a challenging yet valuable task for academia and in-
dustries. This section surveys existing works and advances for binarizing BERT pre-trained
models.